Takeaway: We can use methods below, but each have caveats:
- Generate & Test (see code) - Works with small input space or complex function.
- Calculus - Assumes function has a derivative and solvable = 0
- Newton's method (gradient ascent) - Function has derivative, but expects single optimum (can get s tuck in local optima)
These assumptions don't hold if:
- Big input space (G&T)
- Complex function (Calculus, Newton)
- No derivative (or hard to find) (Calculus, Newton)
- Possibly many local optima (Newton)
The solution, is to use Random Optimization
Hill Climbing (Non-random)¶
Algorithm (steepest ascent hill climbing):
- Find neighbor with the highest function value.
- If neighbor is above current value, move to the new point.
- Otherwise, stop (either local/global optima)

Quiz: Guess my word/bits¶

Takeaways: Well behaved problem. This problem has only 1 optima.
Random Restart Hill Climbing¶
Algorithm¶
Once local optimum reached, try again starting from a randomly chosen x.
- Multiple tries to find a good starting place.
- Not much more computationally expensive (constant factor).
Issues¶
If the range of input space that reaches global optimum is narrow (ie green zone in image), then there is high chance that we will only random restart at point where it reaches the local optima (ie yellow zone)

Quiz: Random Hill Climbing Eval¶
This is a complicated answer, watch video for full info: https://www.youtube.com/watch?v=Ts9vLVxyqdY

Takeaway: The success of random hillclimbing depends a lot on the size of the attraction base (area that points toward global optima). If it is narrow, it won't do as much better than trying every point in $X$.
Simulated Annealing¶
Motivation¶
Idea: "Don't alaways improve (exploit) - sometimes you need to search (explore) with the hope of finding something even better".
- Hill climbing can be considered 100% exploiting, which is more likely to leading to local optima. Kind of like "believing too much" in your data, which relates to overfitting.
Algorithm¶
From CMU notes:
The algorithm is basically hill-climbing except instead of picking the best move, it picks a random move. If the selected move improves the solution, then it is always accepted. Otherwise, the algorithm makes the move anyway with some probability less than 1. The probability decreases exponentially with the “badness” of the move, which is the amount deltaE by which the solution is worsened.
Simulated annealing is typically used in discrete, but very large, configuration spaces, such as the set of possible orders of cities in the Traveling Salesman problem and in VLSI routing.

Key Ideas:
- If $f(x_t)$ is greater than $f(x)$, we just accept it and move to $x_t$.
- If T (temperature) goes to infinity, like hill climbing. If T goes to zero, like random walk. Why?
- Hill Climbing: If low T, numerator becomes big. $\frac{x_t - x}{T}$ goes to infinity and its inverse of $e$ goes to zero. Zero means we reject $x_t$.
- Random Walk: If high T, numerator doesn't matter. $\frac{x_t - x}{T}$ goes to zero, meaning $e^0$. Its inverse is $\frac{1}{e^0}=1$. 1 means we accept $x_t$.
- Metallurgy analogy: "High temperature, molecules bounce around (random). Cold temperature, molecules stay put (hill climb)".
- How to apply? Decrease $T$ slowly over time.
Properties of Simulated Annealing¶
- Probability of ending at x is most likely at areas of high fitness $f(x)$
- As $T$ goes down, the formula acts like a max, putting all the probability mass on $x^*$
- As $T$ goes up, the distribution becomes more random.
- Formula is simlar to the Boltzmann Distribution
$$
P(\text{ending at x}) = \frac{e^{\frac{f(x)}{T}}}{Z_T}
$$
Genetic Algorithms¶
Reference:
- Mitchell, Ch.9 "Genetic Algorithms" p.249
Motivation¶
Terminology:
- Population: Group of individual search points.
- Mutation: Local search of each individual along their neighborhoods.
- Crossover - Takes different points of individuals and combines them together to find a more optimal location.
- Generations - Iterations of improvement across the population.

Takeaways:
- Like random hill climbing but done with parallel computing.
- Crossover - Population shares information with each other.
Algorithm¶
- $P_0$ : Initialize population of size $K$
- Repeat until converged:
- Compute fitness over all individuals in population $x \in P_t$
- Select "most fit" individuals. How?
- "Truncation Selection" - Top half of those that returned best value given $f(x)$. Weighed towards exploitation.
- "Roulette Wheel" - Select individuals at random, but give x's with higher scores a higher probability of being selected. Weighted towards exploration.
- Pair up most fit individuals and create new point via crossover/mutation. New offspring replaces "least fit" individuals
Crossover Example¶

Takeaways:
- 1-point crossover and inductive bias: If split at index, locality of bits matter (bits closer together, stay together). Bias because location of information might matter.
- Uniform crossover: Exact same distribution over each pair, locality doesn't matter.
Summary: RO pt.1¶
- Randomized Optimization: Random steps, start in random places. Useful if no gradient pointing the way.
- Types: Hill climbing, Hill climbing + random restarts.
- Simulated annealing: Temperature, captures probability distribution.
- Genetic algorithms: Population, crossover, parallel random hill climbing
- Overall, these algorithms are memoryless.
MIMIC¶
Reference: De Bonet, Isbell, Viola, "MIMIC: Finding Optima by Estimating Probability Densties", 1997
Motivation¶
- Other algos were memoryless, and no structure (just points) were being communicated.
- Unclear what probability distribution we are dealing with.
How? Directly model a probability distribution and succesfully refine this model over time. This then conveys the structure the search space and its best points.

Algorithm¶
- Generate samples from distribution at time step $t$: $P(x)^{\theta_t}$
- Set $\theta_{t+1}$ to the $n$th percentile (ie. 50th percentile, etc)
- Like Genetic Algorithm, choose those that are "best fits"
- Retain only those samples (subject to $f(x) \geq \theta_{t+1}$)
- Estimate $P(x)^{\theta_{t+1}}$
- Repeat until convergence
- Ideally we will move from $\theta_{min}$ to $\theta_{max}$
Challenges:
- Assumes we can estimate $P(x)^{\theta_{t+1}}$.
- Assumes diff between $P^\theta$ and $P^{\theta_\epsilon}$ should be close, ie. current sample would contain samples from the next distribution we're looking for.
- The higher $P^\theta$ becomes, it becomes harder to represent the distribution of input space.
Estimating Distributions¶

* Example of joint probability distribution (ref):
Finding Dependency Trees¶
NOTE: I got lost on the math. Revisit this lecture to refresh: https://www.youtube.com/watch?v=R-Mf9-tKC5o.

Takeaway: The best dependency tree we can find is the one that minimizes all of the entropy for each of the features, given the product of its parents.
- Looking at it another way - To find the best dependency tree, is to maximize mutual information between every feature and its parent (last equation). Best DT is the one that captures dependencies the best.